Arrow keys / Space to navigate

Module 11: Handling Scale in Serverless Applications

Developing Serverless Solutions on AWS

Thinking Serverless at Scale

Analogy: Highway system. Each service has a lane capacity (quota). If one lane jams (throttle), traffic backs up. You need to know each road's speed limit and plan alternate routes.

API Gateway - Managing Scale

FeatureWhat It DoesAnalogy
Account Quota10,000 requests/sec across all APIs (default)Highway speed limit for your region
Burst Capacity5,000 requests immediate burst (token bucket)Passing lane - short bursts allowed
Stage ThrottlingPer-stage rate/burst limitsSpeed limit per road segment
Route/MethodPer-route throttle (HTTP) or per-method (REST)Speed bumps on specific streets
Usage PlansPer-client throttle + monthly quota (API keys)Toll pass with monthly limit per driver
Response CacheREST APIs: cache responses to reduce backend callsSaved answer - don't re-ask the same question

Throttling applies most-granular first: Client > Route > Stage > Account

Lambda Concurrency Scaling

Concurrency = Requests x Duration (seconds) Provisioned (always warm) On-Demand (auto-scales) Throttled (429) Zero cold starts Cold starts possible Over quota = rejected Key Quotas: Regional concurrency: 1,000 (default, can request increase) Concurrency Scaling Rate: limits how FAST concurrency can spike up Reserved: guarantees capacity for one function (subtracts from shared pool)
Provisioned = reserved seats on a plane (always yours). On-demand = standby (might wait). Throttled = flight is full, come back later.

Function Duration Impacts Concurrency & Cost

Duration10 req/sec needs100 req/sec needsInsight
100ms1 concurrent10 concurrentLow concurrency, low cost
1 sec10 concurrent100 concurrentModerate
10 sec100 concurrent1,000 concurrentHits default quota!
60 sec600 concurrent6,000 concurrentWay over default limit
Concurrency = seats in a restaurant. If each customer stays 10 minutes, you need 10 seats for 1 customer/minute. If they stay 60 minutes, you need 60 seats for the same rate. Shorter functions = fewer seats needed = cheaper.

Scaling with Sync & Async Sources

Synchronous (API Gateway) Client waits for response Throttled? Client gets 429 immediately No built-in retry (client must retry) User feels the throttle directly! Asynchronous (SNS/EventBridge) Client gets 202 immediately Throttled? Lambda retries up to 6 hours Events queued internally User doesn't feel throttle!
Sync = drive-through (you wait in line, feel every delay). Async = online order (submit and go, they process when ready).

Scaling with SQS Event Source

Tuning for Scale

SettingImpact on Scale
Batch size (1-10,000)Larger batch = fewer invocations needed
Batch window (0-5min)Wait to fill batch = fewer invocations, higher latency
Visibility timeoutSet 6x function timeout to avoid duplicate processing
Max concurrencyCap to protect downstream services

Scaling with Kinesis Data Streams

Concurrency = Shards x Parallelization Factor 3 Shards Shard 1 Shard 2 Shard 3 x Parallelization Factor = 2 = 6 concurrent Lambda instances Scale Levers: 1. Add shards (more ingest) 2. Increase parallelization (1-10) 3. Increase batch size 4. Add batch window 5. Enhanced fan-out

Enhanced Fan-Out for Multiple Consumers

FeatureStandard ConsumerEnhanced Fan-Out
Throughput2 MB/sec per shard SHARED across all consumers2 MB/sec per shard PER consumer (dedicated)
DeliveryPull (poll-based)Push (SubscribeToShard)
Latency~200ms avg~70ms avg
Best for1-2 consumers, cost-sensitive3+ consumers, low latency critical
Standard = shared TV antenna (split signal gets weaker per viewer). Enhanced fan-out = dedicated cable line per household (full bandwidth each).

Metrics That Indicate Scaling Issues

ServiceMetricWhat It Means
LambdaThrottlesHitting concurrency limit - increase quota or add reserved
LambdaDuration (p99)Approaching timeout - optimize or increase memory
LambdaConcurrentExecutionsNear quota - time to request increase
SQSApproximateAgeOfOldestMessageGrowing = processing can't keep up
SQSApproximateNumberOfMessagesQueue depth growing = add concurrency
KinesisIteratorAgeGrowing = consumer falling behind producer
KinesisReadProvisionedThroughputExceededNeed more shards or enhanced fan-out
API GW4XXError / 5XXErrorClients being throttled or backend failing

What's New (2024-2025)

Q1: What is the formula for Lambda concurrency?

B) Requests/sec x Duration(sec)
10 req/sec with 1s duration = 10 concurrent. Shorter functions = less concurrency needed = less cost.
A: Memory doesn't affect concurrency count. C: That's Kinesis-specific. D: Those are config, not concurrency formula.

Q2: A Lambda function with SQS source is hitting throttles. What should you do FIRST?

C) Request concurrency quota increase
If throttling, you've hit the regional limit (default 1000). Request increase via Service Quotas.
A: Memory doesn't affect concurrency quota. B: Max Concurrency LIMITS scaling (opposite of what you want). D: DLQ handles failures, unrelated to throttle.

Q3: How do you increase Kinesis stream processing throughput?

B) Add shards + increase parallelization factor
More shards = more ingest capacity + more concurrent Lambda instances. Parallelization (1-10) multiplies concurrency per shard.
A: Memory helps speed per invocation, not throughput at stream level. C: API GW isn't involved with Kinesis. D: SQS is a different source.

Q4: Why does async invocation reduce throttle impact on clients?

B) Client gets 202; Lambda handles retries internally
The client doesn't wait or see the throttle. Lambda queues the event and retries for up to 6 hours until concurrency is available.
A: Same concurrency limits apply. C: Same function, same speed. D: IAM always applies.

Live Demo: Load Testing & Observing Scale

Step 1: Deploy a function with reserved concurrency

aws lambda put-function-concurrency \
  --function-name my-api-handler \
  --reserved-concurrent-executions 10

Step 2: Generate load (exceed the limit)

# Install artillery for load testing
npm install -g artillery

# Create load test (20 req/sec for 30 seconds)
artillery quick --count 20 --num 30 \
  https://API_ID.execute-api.us-west-2.amazonaws.com/prod/items

Step 3: Observe in CloudWatch

# Watch throttles in real-time
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda --metric-name Throttles \
  --dimensions Name=FunctionName,Value=my-api-handler \
  --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 --statistics Sum

Demo: What to Show

ActionWhat Students See
Set reserved=10, send 20 req/secThrottles appear, 429 errors in client
Increase reserved to 50Throttles disappear, all requests succeed
Remove reserved, show account poolScales freely up to 1000
Add provisioned=10First 10 are instant (no cold start), rest have cold starts
Show CloudWatch metricsConcurrentExecutions, Throttles, Duration graphs

Cleanup

aws lambda delete-function-concurrency --function-name my-api-handler
aws lambda delete-provisioned-concurrency-config \
  --function-name my-api-handler --qualifier prod

Module Summary